emphatic weighting
- North America > Canada > Alberta (0.15)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada > Quebec > Montreal (0.04)
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm--called Actor Critic with Emphatic weightings (ACE)--that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods--particularly OffPAC and DPG--converge to the wrong solution whereas ACE finds the optimal solution.
- North America > Canada > Alberta (0.15)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada > Quebec > Montreal (0.04)
Reviews: An Off-policy Policy Gradient Theorem Using Emphatic Weightings
The paper formulates and proves a policy gradient theorem in the off-policy setting. The derivation is based on emphatic weighting of the states. Based on the introduced theorem, an actor-critic algorithm, termed ACE, is further proposed. The algorithm requires computing policy gradient updates that depend on the emphatic weights. Computing low-variance estimates of the weights is non-trivial, and the authors introduce a relaxed version of the weights that interpolate between the off-policy actor-critic (Degris et al., 2012) and the unbiased (but high variance) estimator; the introduced estimator can be computed incrementally.
Off-Policy Actor-Critic with Emphatic Weightings
Graves, Eric, Imani, Ehsan, Kumaraswamy, Raksha, White, Martha
A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods--particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)--converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.67)
Learning Expected Emphatic Traces for Deep RL
Jiang, Ray, Zhang, Shangtong, Chelu, Veronica, White, Adam, van Hasselt, Hado
Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic weightings and multi-step updates. This approach, however, is generally limited to sampling complete trajectories in order, to compute the required emphatic weighting. In this paper we investigate how to combine emphatic weightings with non-sequential, off-line data sampled from a replay buffer. We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$-step TD learning algorithm to learn the required emphatic weighting. We show that these state weightings reduce variance compared with prior approaches, while providing convergence guarantees. We tested the approach at scale on Atari 2600 video games, and observed that the new X-ETD($n$) agent improved over baseline agents, highlighting both the scalability and broad applicability of our approach.
- North America > Canada > Quebec > Montreal (0.14)
- North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- (4 more...)
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Imani, Ehsan, Graves, Eric, White, Martha
Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings.
A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning
Suttle, Wesley, Yang, Zhuoran, Zhang, Kaiqing, Wang, Zhaoran, Basar, Tamer, Liu, Ji
In this work we develop a new off-policy actor-critic algorithm that performs policy improvement with convergence guarantees in the multi-agent setting using function approximation. To achieve this, we extend the method of emphatic temporal differences (ETD(λ)) to the multi-agent setting with provable convergence under linear function approximation, and we also derive a novel off-policy policy gradient theorem for the multi-agent setting. Using these new results, we develop our two-timescale algorithm, which uses ETD(λ) to perform policy evaluation for the critic step at a faster timescale and policy gradient ascent using emphatic weightings for the actor step at a slower timescale. We also provide convergence guarantees for the actor step. Our work builds on recent advances in three main areas: multi-agent on-policy actor-critic methods, emphatic temporal difference learning for off-policy policy evaluation, and the use of emphatic weightings in off-policy policy gradient methods.
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- North America > United States > Illinois (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Imani, Ehsan, Graves, Eric, White, Martha
Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.
- North America > Canada > Alberta (0.15)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada > Quebec > Montreal (0.04)
An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Imani, Ehsan, Graves, Eric, White, Martha
Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified form for the gradient. In off-policy learning, however, where the behaviour policy is not necessarily attempting to learn and follow the optimal policy for the given task, the existence of such a theorem has been elusive. In this work, we solve this open problem by providing the first off-policy policy gradient theorem. The key to the derivation is the use of emphatic weightings. We develop a new actor-critic algorithm—called Actor Critic with Emphatic weightings (ACE)—that approximates the simplified gradients provided by the theorem. We demonstrate in a simple counterexample that previous off-policy policy gradient methods—particularly OffPAC and DPG—converge to the wrong solution whereas ACE finds the optimal solution.
- North America > Canada > Alberta (0.14)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada > Quebec > Montreal (0.04)